Prediction from Blog Data
نویسندگان
چکیده
We have approximately one year’s worth of blog posts from [1] with over 12 million web blogs tracked. On average there are 500 thousand blog posts per day. In this project, we are attempting to extract from the blog data the set of features that are predictive of the movie gross sales, critics ratings, and viewers ratings (collected by sites like [3]). Having gained this insight, we are applying machine learning techniques to make predictions on sales and ratings for future movies. This is useful for example, to see if the ‘buzz’ surrounding a movie is sufficient to obtain high sales, or whether additional marketing is required. In addition, we could automatically figure out ratings of movies based on their mentions in blogs. This could be useful for movie rating sites like Rotten Tomatoes [3] and IMDB [4], either to verify the correctness of ratings, or to ‘seed’ ratings for a new movie or a movie that did not exist in their database. Additionally, in some instances information like gross sales may not be publicly available, so we would like to make a prediction instead. At a high level, we would like to see if the “online chatter” is useful for analysis and prediction of the quality and popularity of various items, be it movies, songs, books, restaurants or other commercial products. In our project, for the top movies of 2008 (≈ 200 of them that have non-negligible blog mentions) and top movies of all time (≈ 250), we try to predict the following: the gross sales, the critics rating and the average viewers rating. We selected a long list of relevant features and populated them after parsing the blog data. After performing training, crossvalidation, feature and model selection with a primitive set of models, we have reasonable error rates for most output variables. We find that Naive Bayes and SVM are the best prediction algorithms for our data, and PCA generally works best as a feature selection method. Finally, we have observed some interesting patterns that suggest that quite accurate predictions can take place, given more sophisticated algorithms.
منابع مشابه
Examining the Effects of Writing Instruction through Blogging on Second Language Writing Performance and Anxiety
This study investigated the effects of blog-mediated instruction on English-as-a-foreign language (EFL) learners’ writing performance and anxiety. In addition, it aimed to probe into the EFL learners’ attitudes towards blog-mediated writing instruction. The participants of the study included forty-six Iranian EFL learners from two intact university classes, who were randomly assigned to the Con...
متن کاملPredicting Depression for Japanese Blog Text
This study aims to predict clinical depression, a prevalent mental disorder, from blog posts written in Japanese by using machine learning approaches. The study focuses on how data quality and various types of linguistic features (characters, tokens, and lemmas) affect prediction outcome. Depression prediction achieved 95.5% accuracy using selected lemmas as features.
متن کاملEffects of Blog-Mediated Writing Instruction on L2 Writing Motivation, Self-Efficacy, and Self-Regulation: A Mixed Methods Study
Employing an explanatory sequential design, this study investigated the effects of a blog-mediated writing course on L2 students’ writing motivation, self-efficacy, and self-regulation. A number of 46 Iranian EFL learners from 2 intact university classes were recruited as the participants and were randomly assigned into the control group (n = 21) and the experimental group (n ...
متن کاملWildlife Damage Estimation and Prediction Using Blog and Tweet Information
Wildlife damage estimation and prediction using blog and tweet information is conducted. Through a regressive analysis with the truth data about wildlife damage which is acquired by the federal and provincial governments and the blog and the tweet information about wildlife damage which are acquired in the same year, it is found that some possibility for estimation and prediction of wildlife da...
متن کاملPredicting Author Blog Channels with High Value Future Posts for Monitoring
The phenomenal growth of social media, both in scale and importance, has created a unique opportunity to track information diffusion and the spread of influence, but can also make efficient tracking difficult. Given data streams representing blog posts on multiple blog channels and a focal query post on some topic of interest, our objective is to predict which of those channels are most likely ...
متن کاملAn Investigation of Iranian EFL University Learners’ Creative Thinking and Critical Thinking Skills in a Pedagogical Blog: A Mixed-Methods Approach
The present study explored the effect of a pedagogical blog on Iranian EFL learners’ creative and critical thinking skills using a mixed-methods approach. In the pedagogical blog, the researchers asked learners divergent and evaluative questions based on Lindley’s model (1993). The quantitative data were collected by administering Creativity Test Questionnaire (ATC) and ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008